29 research outputs found
Knowledge-light Letter-to-Sound Conversion for Swedish with FST and TBL
This paper describes some exploratory attempts to apply a combination of finite state transducers (FST) and transformation-based learning (TBL, Brill 1992) to the problem of letter-to-sound (LTS) conversion for Swedish. Following Bouma (2000) for Dutch, we employ FST for segmentation of the textual input into groups of letters and a first transcription stage; we feed the output of this step into a TBL system. With this setup, we reach 96.2% correctly transcribed segments with rather restricted means (a small set of hand-crafted rules for the FST stage; a set of 12 templates and a training set of 30kw for the TBL stage). Observing that quantity is the major error source and that compound morpheme boundaries can be useful for inferring quantity, we exploratively add good precision-low recall compound splitting based on graphotactic constraints. With this simple-minded method, targeting only a subset of the compounds, performance improves to 96.9%
Digitizing Intangible Cultural Heritage
As part of the UNESCO project "Establishment of a National Inventory and Electronic Database of Lithuanian Intangible Cultural Heritage" the authors, representing the EU-funded project "European Cultural Heritage Online" (ECHO) were invited to give a course in digital archiving called "Digitizing Intangible Cultural Heritage" in Vilnius, Lithuania, March 15 to 20, 2004. The present report summarizes very briefly the sessions given. Thereafter, the analyses of the state of the digitization work of the participating institutes and recommendations for the future are given in a dedicated, stand-alone section
A Multi-lingual Speech Corpus for Cognitive Research
We present the speech corpus SMALLWorlds (Spoken Multi-lingual Accounts of Logically Limited Worlds), newly established and still
growing. SMALLWorlds contains monologic descriptions of scenes or worlds which are simple enough to be formally describable. The
descriptions are instances of content-controlled monologue: semantically “pre-specified” but still bearing most hallmarks of spontaneous
speech (hesitations and filled pauses, relaxed syntax, repetitions, self-corrections, incomplete constituents, irrelevant or redundant
information, etc.) as well as idiosyncratic speaker traits. In the paper, we discuss the pros and cons of data so elicited. Following that,
we present a typical SMALLWorlds task: the description of a simple drawing with differently coloured circles, squares, and triangles,
with no hints given as to which description strategy or language style to use. We conclude with an example on how SMALLWorlds may
be used: unsupervised lexical learning from phonetic transcription. At the time of writing, SMALLWorlds consists of more than 250
recordings in a wide range of typologically diverse languages from many parts of the world, some unwritten and endangered